f 



(19) 




Europaisches Patentamt 
European Patent Offic 
Office europ^n des brevets 






I 




(11) 



EP0 702 815 B1 



(12) 



EUROPEAN PATEI^ SPECIFICATION 



(45) Date of publication and mention 
of the grant of the patent: 
23.08.2000 Bulletin 2000/34 

(21) Application nunrijer: 94921242.7 

(22) Date of filing: 02.06.1994 



(51) lnt.Cl7: G06F 11/14 

(86) international application nunrtber: 
PCT/US94/06320 

(87) International put>lication number: 

WO 94/29807 (22.12.1994 Gazette 1994/28) 



(54) WRITE ANYWHERE FILE-SYSTEM LAYOUT 

ANORDNUNG EINES DATEiSYSTEMS ZUM BESCHREIBEN BELIEBIGER BEREICHE 

DISPOSITION D'UN SYSTEME DE FICHIERS A ECRITURE DANS UNE ZONE NGN 
PREDETERMINEE 



CO 

in 

CM 

o 
o 

Q. 

UJ 



(84) Designated Contracting States: 

AT BE CH DE DK ES FR GB GR IE IT LI LU MC NL 
PTSE 

(30) Priority: 03.06,1993 US 71643 

(43) Date of publication of application: 
27.03.1996 Bulletin 1996/13 

(GO) Divisional application: 
99120949.5/1 003 103 

(73) Proprietor: 

Network Appliance, Inc. 
Sunnyvale, California 94089 (US) 

(72) Inventors: 

• HITZ, David 
Sunnyvale, CA 94086 (US) 

• A/IALCOM, Michael 

Los Aftos, CA 94022 (US) 

• LAU, James 
Cupertino, CA 95014 (US) 



• RAKITZIS, Byron 
Moutain View, CA 94043 (US) 

(74) Representative: 

Leeming, John Gerard 
J.A. Kemp & Co., 
14 South Square, 
Gray's Inn 

London WC1R5LX(GB) 



(56) References cited: 
EP A- 0 359 384 
US-A- 5 043 871 
US-A-5 163 148 



EP A- 0 453 193 
US A- 5 043 876 



B SRINIVASAN ET AL: "Recoverable file system 
for microprocessor systems" 
MICROPROCESSORS AND MICROSYSTEMS., 
vol. 9, no. 4, May 1985, LONDON GB, pages 179- 
183, XP002031805 

The Episode RIe System, USENIX, Winter 1992, 
pp.43-59 t>y Sailesh Chutani et al. 



Note: Within nine months from the put^ication of the mention of the grant of the European patent, any person may give 
notice to the European Patent Office of opposition to the European patent granted- Notice of opposition shall be filed in 
a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 
99(1) European Patent Convention). 



Primed by Xerm (UK) Businsss Servtoes 
2.t6.7 (HRSJ/3.6 



1 



EP 0 702 815 B1 



2 



Descripti n 

1. FIELD OF THE INVENTION 

[0001 ] The present invention is related to the field of 5 
methods arxJ apparatus for maintaining a consistent f fle 
system and for creating read-only copies of the file sys- 
tem. 

2. BACKGROUND ART io 

[0002] All file systems must maintain consistency in 
sprte of system failure. A nurrt>er of different consist- 
ency techniques have been used in the prior art for this 
purpose. '5 
[0003] One of the nr>ost difficult and time consuming 
issues in managing any file server is making backups of 
file data. Traditional solutions have been to copy the 
data to tape or other off-line media. With some file sys- 
tems, the file server must be taken off-line during the 20 
tackup process in order to ensure that the t>ackup is 
conrpletely consistent. A recent advance in backup is 
the atDility to quickly "clone" (i.e.. a prior art method for 
creating a read-only copy of the file system on disk) a 
file system, and perform a fc>ackup from the clone 25 
instead of from the active file system. With this type of 
file system, it allows the file server to remain on-line dur- 
ing the backup. 

File System Consistency 30 

[0004] A prior art file system is disclosed by Chu- 
tani, et al. in an article entitled The Episode File 
System, USENIX, Winter 1992. at pages 43-59. The 
article describes the Episode file system which is a file 35 
system using meta-data (i.e.. inode tables, directories, 
bitmaps, and indirect blocks). It can be used as a starxJ- 
alone or as a distributed file system. Episode supports a 
plurality of separate file system hierarchies. Episode 
refers to the plurality of file systems collectively as an 40 
"aggregate". In particular, Episode provides a clone of 
each file system for slowly changing data. 
[0005] In Episode, each logical file system contains 
an "anode" tattle. An anode table is the equivalent of an 
inode table used in file systems such as the Berkeley 45 
Fast Re System. It is a 252-byte structure. Anodes are 
used to store all user data as well as meta-data in the 
Episode file system. An anode describes the root direc- 
tory of a file system including auxiliary files and directo- 
ries. Each such file system in Episode is referred to as a so 
"fileset". All data within a fileset is <ocatable by iterating 
through the anode table and processing each file in 
turn. Episode aeates a read-only copy of a file system, 
herein referred to as a "done", and shares data with the 
active file system using Copy-On-Write (COW) tech- ss 
niques- 

[0006] Episode uses a logging technique to recover 
a file system(s) after a system crashes. Logging 



ensures that the file system meta-data are consistent. A 
bitmap table contains information about whether each 
block in the file system is allocated or not Also, the bit- 
vnap table indcates whetiier or not each block is logged. 
All meta-data updates are recorded in a log "container" 
that stores transaction log of the aggregate. The log is 
processed as a circular buffer of disk t)locks. The trans- 
action logging of Episode uses logging techniques orig- 
inally developed for databases to ensure f2e system 
consistency. This technique uses carefully order writes 
and a recovery program that are supplemented by data- 
base techniques in the recovery program. 
[0007] Other prior art systems including JFS of IBM 
and VxFS of Veritas Corporation use various forms of 
transaction logging to speed the recover process, txit 
still require a recovery process. 
[0008] Another prior art method is called the 
"ordered write" technique. It writes all disk blocks in a 
carefully determined order so that damage is minimized 
when a system failure occurs while performing a series 
of related writes. The prior art attempts to ensure that 
inconsistencies that occur are harmless. For instance, a 
few unused blocks or inodes being marked as allocated. 
The primary disadvantage of this technique is that the 
restrictions it places on disk order make it hard to 
achieve high performance. 

[0009] Yet another prior art system is an elatX)ration 
of the second prior art method referred to as an 
"ordered write with recovery" technique. In this method, 
inconsistencies can be potentially harnnful. However, 
the order of writes is resti'icted so that incor^istendes 
can be found and fixed by a recovery progranx Exam- 
ples of this method include the original UNIX file system 
and Berkeley Fast File System (FFS). This technique 
does reduce disk ordering sufficientiy to eliminate the 
performance penalty of disk ordering. Another disad- 
vantage is that tfie recovery process is time consuming. 
It typically is proportional to the size of the file system. 
Therefore, for example, recovering a 5 GB FFS file sys- 
tem requires an hour or more to perform. 

File System Clones 

[0010] Rgure 1 is a prior art diagram for the Epi- 
sode file system illustrating the use of copy-on -write 
(COW) techniques for creating a fileset done. Anode 
1 10 comprises a first pointer 1 1DA having a COW bit 
that is set Pointer 11 OA references data block 114 
directiy. Anode 110 comprises a second pointer 110B 
fnaving a COW bit that is cleared. Pointer 1 1 OB of anode 
references indirect block 112. Indirect tAock 112 com- 
prises a pointer 112A that references data trfock 124 
directly. The COW bit of pointer 112A is set. Indirect 
block 112 conprises a second pointer 1 12B that refer- 
ences data block 126. The COW bit of pointer 112B is 
deared. 

[001 1 ] A clone anode 1 20 comprises a first pointer 
120A that references data block 1 14. The COW bit of 
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pointer 120A is deared. The second pointer 120B of 
clone anode 120 references indirect block 122. The 
COW brt of pointer 120B is deared. In turn, indirect 
block 122 comprises a pointer 122A that references 
data block 124. The COW brt of pointer 122 A is cleared. 

[001 2] As illustrated in Rgure 1 . every dired pointer 
11 OA. 112A-112B. 120 A, and 1 22 A and indirect pointer 
11 OB and 120B in the Episode file system contains a 
COW bit. Blocks that have not been modified are con- 
tained in both the active file system and the done. arxJ 
have set (1) COW bits. The COW bit is deared (0) when 
a block that is referenced to by the pointer has been 
modified and. therefore, is part of the active file system 
but not the done. 

[(K)1 3) When a copy-on-write bkxk is modified, as 
shown in Figure 1. a new block is allocated and 
updated. The COW flag in the pointer to this new block 
is then set. The COW bit of pointer -11 OA of original 
anode 1 10 is cleared. Thus, when the ctone anode 120 
is created, pointer 120A of done anode 120 references 
data block 1 14 also. Both original anode 110 and clone 
anode 120 reference data block 114. Data block 124 
has also been modified as indicated by a cleared COW 
bit of pointer 112A in original indirect block 112. Thus, 
when the clone anode is created, indired block 122 is 
created. Pointer 122A of indired block 122 references 
data block 124. and the COW bit of pointer 122 A is 
cleared. Both indired block 122 of the original anode 
1 1 0 and indirect l>lock 1 22 of clone anode 1 20 reference 
data block 124. 

(0014] Figure 1 illustrates copying of an anode to 
create a clone anode 120 for a single file. However, 
clone anodes must be aeated for every file having 
changed data blocks in the file system. At the time of the 
clone, all inodes must be copied. Creating done anodes 
for every modified file in the file system can consume 
significant amounts of disk space. Further. Episode is 
not capalDle of supporting multiple clones since each 
pointer has only one COW bit. A single COW bit is not 
able to distinguish more than one clone. For more than 
one clone, there is not a second COW bit that can be 
set. 

[0015] A fileset "clone" is a read-only copy of an 
adive fileset wherein the active fOeset is readable arxl 
writable. Clones are implemented using COW tech- 
niques, and share data blocks with an adive fileset on a 
block-by-block basis. Episode implements doning by 
copying each anode stored in a fileset. When initially 
cloned, both the writable anode of the active fileset arxi 
the ctoned anode both point to the same data block(s). 
However, the disk addresses for dired and indired 
blocks in the original anode are tagged as COW. Thus, 
an update to the writable fileset does not affed the 
clone. When a COW t^lock is nrxxJified, a new block is 
allocated in the file system and updated with the modifi- 
cation. The COW flag in the pointer to this new b>lock is 
cleared 

[001 6] The prior art Episode system creates clones 



that duplicate the entire inode file and all of the indirect 
blocks in the file system. Episode duplicates all inodes 
and indired blocks so that it can set a Copy-On-Write 
(COW) bit in all pointers to blocks that are used by both 
5 the active file system and the clone. In Episode, rt is 
important to identify these blocks so that new data writ- 
ten to the adive file system does not overwrite "oW" 
data that is part of the clone and, therefore, must not 
change. 

JO [001 7] Creating a done in the prior art can use up 
as much as 32 MB on a 1 GB disk. The prior art uses 
256 MB of disk space on a 1 GB disk (for 4 KB fcdocks) 
to keep eight dones of the file system. Thus, the prior 
art cannot use large numl>ers of ctones to prevent loss 

15 of data. Instead it used to facilitate backup of the file 
system onto an auxiliary storage means other than the 
disk drive, such as a tape backup device. Clones are 
used to backup a file system in a consistent state at the 
instant the clone is made. By doning the file system, the 

20 done can be backed up to the auxiliary storage means 
without shutting down the adive file system, and 
thereby preventing users from usir)g the file system. 
Thus, dones allow users to continue accessing an 
adive file system while the file system, in a consistent 

25 state is backed up. Then the clone is deleted once the 
backup is completed. Episode is not capable of support- 
ing multiple dones since each pointer has only one 
COW bit A single COW bit is not able to distinguish 
more than one done. For more than one done, there is 

30 no secorxi COW bit that can be set. 

[0018] A disadvantage of the prior art system for 
creating file system clones is that it involves duplicating 
all of the inodes and all of the indired blocks in the file 
system. For a system with many small files, the inodes 

35 alone can consume a significant percentage of the total 
disk space in a file system. For example, a 1 GB f ile sys- 
tem that is filled with 4 KB files has 32 MB of inodes. 
Thus, creating an Episode done consumes a significant 
amount of disk space, and generates large amounts 

40 (i.e.. many megabytes) of disk traffic. As a result of 
these conditions, creating a clone of a file system takes 
a significant amount of time to complete. 
[001 9] Another disadvantage of the prior art system 
is that it makes it difficult to create rrujltiple clones of the 

45 same file system. The result of this is tfiat clones tend to 
be used, one at a time, for short term operations such 
as ticking up the file system to tape, and are then 
deleted. 

[0020] The present invention, which is defined in 
50 the amended claims, provides a method for maintaining 
a file system in a consistent state and for creating read- 
only copies of a file system. Changes to the ffle system 
are tightly controlled to maintain the file system in a con- 
sistent state. The file system progresses from one self- 
55 consistent state to another self-consistent state. The set 
of self-consistent l^locks on disk that is rooted by the 
root nx)de is referred to as a consistency point (CP). To 
implement consistency points, WAFL always writes new 
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data to unallocated blocks on disk, tt never cvecwrites 
exiting data. A new consistency point occurs when the 
finsfo block is updated by writing a new root mode for 
the mode file into it. Thus, as tong as the root mode is 
not updated, the state of the file system represented on 5 
disk does not change. 

[0021] The present invention also creates snap- 
shots, which are virtual read-only copies of the file sys- 
tem. A snapshot uses no disk space when H is initially 
created. It is designed so that many different snapshots 10 
can be created for the same file system. Unlike prior art 
file systems that aeate a done by diplicating the entire 
mode fde and all of the indirect blocks, the present 
invention duplicates only the mode that describes the 
nxxje f3e. Thus, the actual disk space required for a 15 
snapshot is only the 128 bytes used to store the dupli- 
cated mode. The 128 bytes of the present invention 
required for a snapshot is significantly less than the 
many megabytes used for a done in the prior art 
[0022] The present invention prevents new data 20 
written to the active file system from ovenvriting "old" 
data that is part of a snapshot(s). It is necessary that old 
data not be overwritten as tong as it is part of a snap- 
shot. This is accomplished by using a multi-bit free- 
t)lock map. Most prior art file systems use a free block 25 
map having a single bit per l^lock to indicate whether or 
not a block is allocated. The present invention uses a 
block map having 32-bit entries. A first bit indicates 
whether a block is used by the active file system, and 20 
remaining bits are used for up to 20 snapshots, how- 30 
ever, some t)its of the 31 bits may be used for other pur- 
poses. 



Rgure 8 is a diagram illustrating an incore inode of 
WAFL according to the present invention- 
Figures 9A-9D are diagranrts illustrating incore 
inodes of WAFL having different levels of indirection 
accordir>g to the present inv^tion. 

Rgure 10 is a diagram illustrating an incore inode 
1020forafOe. 

Rgures 11A-11D are diagrams illustrating a block 
map (blkmap) file according to the preserU inven- 
tion. 

Rgure 12 is a diagram illustrating an inode file 
according to the present invention. 

Rgures 13A-13B are diagrams illustrating an inode 
map (inomap) file according to the present inven- 
tion. 

Rgure 14 is a diagram illustrating a directory 
according to the present invention. 

Rgure 15 is a diagram illustrating a file system 
information (fsinfo) structure. 

Rgure 16 is a diagram illustrating the WAFL file 
system. 

Rgures 17A-17L are diagrams illustrating the gen- 
eration of a consistency point. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0023] 

Figure 1 is a block diagram of a prior art "done" of 
a file system. 

Figure 2 is a diagram illustrating a list of inodes hav- 
ing dirty buffers- 



Figure 3 is a diagram illustrating an on-disk inode of 
WAFL 45 

Figures 4A-4D are diagrams illustrating on-disk 
inodes of WAFL having different levels of indirec- 
tion. 

50 

Figure 5 is a flow diagram illustrating the method for 
generating a consistency point 

Figure 6 is a flow cfiagram illustrating step 530 of 
Figure 5 for generating a consistency point ss 

Figure 7 is a flow diagram illustrating step 530 of 
Figure 5 for creating a snapshot 



Rgures 18A-18C are diagrams illustrating genera- 
tion of a snapshot. 

Rgure 19 is a diagram illustrating changes to an 
inode file. 

Rgure 20 is a diagram illustrating fsinfo blocks used 
for maintaining a file system in a consistent state. 

Rgures 21A-21F are detailed diagrams illustrating 
generatior^ of a snapshot. 

Rgure 22 is a cfiagram illustrating an active WAFL 
file system having three snapshots that each refer- 
ence a common file; and. 

Rgures 23A-23B are diagrams illustrating the 
updating of atime. 

DETAILED DESCRIPTION OF THE PRESENT INVEN- 
TION 

[0024] A system for creating read-only copies of a 
file system is descra>ed. In the folfowing description, 
numerous specific details, such as nurTt>er and nature 
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of disks, disk block sizes, etc., are described in detail in 
order to provide a more thorough description of the 
present invention. It will be apparent, however, to one 
skilled in the art. that the present invention may be prac- 
ticed without these specif k; details. In other instances, 
well-known features have not been described in detail 
so as not to unnecessarily obscure the present inven- 
tion. 

WRITE ANYWHERE FILE-SYSTEM UVyOUT 

[0025] The present invention uses a Write Any- 
where File-system l_ayout (WAFL). This disk format sys- 
tem is brfock t>ased (i.e., 4 KB blocks that have no 
fragments), uses inodes to describe its files, and 
includes directories that are sirrply specially formatted 
files. WAFL uses files to store mesa-data that describes 
the layout of the file system. WAFL meta-data files 
include: an inodefile, a tHock map (bikmap) file, and an 
mode map (if>on^ap) ^'1©- The inode file corrtatns the 
inode table for the file system. The bikmap file indicates 
which disk blocks are allocated. The inomap file irxli- 
cates which inodes are allocated, On-disk and incore 
WAFL inode distinctions are discussed below. 

On-Disk WAFL Inodes 

[0026] WAFL inodes are distinct from prior art 
modes. Each on-disk WAFL inode points to 16 blocks 
having the same level of indirection. A block number is 
4 -bytes long. Use of block numbers having the same 
level of indirection in an inode better facilitates recursive 
processing of a file. Rgure 3 is a bJock diagram illustrat- 
ing an on-disk inode 310. The on-disk inode 310 is com- 
prised of standard inode information 31 OA and 16 l>lock 
number entries 31 OB having the same level of indirec- 
tion. The mode information 31 OA comprises information 
about the owner of a file, permissions, file size, access 
time. etc. that are well-known to a person skilled in the 
art- On-disk mode 310 is unlike prior art irKxJes that 
comprise a plurality of bkx:k numbers having different 
levels of indirection. Keepir^ all block number errtries 
3108 in an inode 310 at the same level of indirection 
simplifies file system implementation. 
[0027] For a small file having a size of 64 bytes or 
less, data is stored directly in the inode itself instead of 
the 16 block numbers. Figure 4A is a diagram illustrating 
a Level 0 inode 410 that is similar to inode 310 shown in 
Figure 3. However, inode 410 comprises 64-bytes of 
data 410B instead of 16 bkx:k numt)ers 3108. There- 
fore, disk blocks do not need to be alkx^ated for very 
small files. 

[0028] For a file having a size of less than 64 KB, 
each of the 16 lalock numbers directly references a 4 K8 
data tjlock. Figure 48 is a diagram illustrating a Level 1 
inode 310 comprisir>g 16 block numbers 31 OB. The 
block numt>er entries 0-15 point to corresponding 4 KB 
data blocks 420A-420C. 



[0029] For a file having a size that is greater than or 
equal to 64 KB ami is less than 64 MB. each of the 16 
block numt>ers references a single-indirect iDlock. In 
turn, each 4 KB single-indirect t)lock comprises 1024 

5 t»lock numbers that reference 4 KB data blocks. Figure 
4C is a diagram illustrating a Level 2 inode 310 compris- 
ing 16 bJock numbers 31 OB that reference 16 single- 
indirect blocks 430A-430C. As shown in Figure 4C. 
block number entry 0 points to single-indirect trfock 

JO 430 A, Single-indirect block 430 A comprises 1024 trfock 
numbers that reference 4 KB data t>locks 440A-440C. 
Similarly, single-irtdirect blocks 430B-430C can each 
address up to 1024 data blocks. 

[0030] For a file size greater tan 64 MB, the 1 6 block 
15 numbers of the inode reference dout>te- indirect bk>ck5. 
Each 4 KB double-indirect block comprises 1024 block 
numbers pointir^g to corresponding single-indirect 
blocks. In turn, each single-indirect block comprises 
1024 block numbers that point to 4KB data blocks. 
20 Thus, up to 64 GB can be addressed. Figure 4D is a dia- 
gram illustrating a Level 3 inode 310 comprising 16 
block nurr±»ers 31 OB wherein block numt>er entries 0. 1 , 
and 15 reference double-indirect blocks 470A, 470B. 
and 470C, respectively Dout>!e-indirect block 470A 
25 conprises 1 024 block number entries 0- 1 023 that point 
to 1024 single-indirect blocks 480A-480B. Each single- 
indirect block 480A-480B, in turn, references 1024 data 
blocks. As shown in Figure 4D, single-indirect block 
480A references 1024 data trfocks 490A-490C and sin- 
so gle-irxiirect block 480B references 1024 data blocks 
490C-490F 

Incore WAFL Inodes 

35 [0031] Figure 8 is a block diagram illustrating an 
incore WAFL inode 820. The incore inode 820 com- 
prises the information of on-disk inode 310 (shown in 
Figure 3). a WAFL buffer data structure 820 A, and 16 
buffer pointers 820 B, A WAFL incore inode has a size of 

40 300 bytes. A WAFL Ixiffer is an incore (in memory) 4 KB 
equivalent of the 4 KB t)locks that are stored on disk. 
Incore inode 820 is unlike prior art inodes tfiat refererKe 
txiffers hiaving different levels of irxlirection. Each incore 
WAFL inode 820 points to 16 buffers having the same 

45 level of indirection. A buffer pointer is 4 -bytes long. 
Keeping alt buffer pointers 820B in an inode 820 at the 
same level of indirection simplifies file system imple- 
mentation. Incore inode 820 also contains incore infor- 
mation 820C comprising a dirty flag, an in-consistency 

50 point (IN_CP) flag, and pointers for a linked list The 
dirty flag indicates that the inode itself has been modi- 
fied or that it references buffers that have changed. The 
IN_CP flag is used to mark an inode as being in a con- 
sistency point (described b>elow). The pointers for a 

55 linked list are described l>elow. 

[0032] Figure 1 0 is a diagram illustrating a file refer- 
enced by a WAFL inode 1010. The file comprises indi- 
rect WAFL buffers 1020-1024 and direct WAFL buffers 
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1030-1034. The WAFL in-core inode 1010 comprises 
standard inode information 101 OA (including a count of 
dirty txiffers). a WAFL buffer data structure 1010B. 16 
buffer pointers 101 OC and a standard on-disk inode 
1010D- The in-core WAFL inode 1010 has a size of 
approximately 300 bytes. The on-disk inode is 128 
bytes in size. The WAFL buffer data structure 101 OB 
conprises two pointers where the first one references 
the 16 buffer pointers 1010C and the second refererrces 
the on-disk block numbers 1010D. 
[0033] Each inode 1010 has a count of dirty buffers 
that it references. An inode 101 0 can be put in the list of 
dirty inodes and/or the list of modes that have dirty buff- 
ers. When all dirty buffers referenced by an inode are 
either scheduled to be written to disk or are written to 
disk, the count of dirty buffers to inode 1010 is set to 
zero. The inode 1010 is then requeued according to its 
flag fi e., no dirty buffers). This mode 1010 is cleared 
before the next inode is processed. Further the flag of 
the inode indicating that it is in a consistency point is 
cleared. The inode 1010 itself is written to disk in a con- 
sistency p>oint 

(0034] The WAFL buffer structure is illustrated by 
indirect WAFL buffer 1020. WAFL buffer 1020 com- 
prises a WAFL buffer data structure 1020 A, a 4 KB 
txrffer 1020B conprising 1024 WAFL txjffer pointers 
and a 4 KB buffer 1020C comprising 1024 on-disk tJock 
numbers. The WAFL buffer data structure is 56 bytes in 
size and comprises 2 pointers. One pointer of WAFL 
buffer data structure 1020A references 4 KB txjffer 
10208 and a second pointer references buffer 1020C. In 
Figure 10. the 16 buffer pointers 10 IOC of WAFL inode 
1010 point to the 16 single- indirect WAFL buffers 1020- 
1024. In turn, WAFL Ixjffer 1020 references 1024 direct 
WAFL buffer structures 1030-1034. WAFL buffer 1030 
is representative direct WAFL txiffers. 
[0035] Direct WAFL buffer 1030 comprises WAFL 
buffer data structure 1030 A and a 4 KB direct buffer 
1030B containing a cached version of a corresponding 
on-disk 4 KB data tdock Direct WAFL buffer 1 030 does 
not comprise a 4 KB buffer such as buffer 1 020C of indi- 
rect WAFL buffer 1020. The second buffer pointer of 
WAFL buffer data structure 1030A is zeroed, and there- 
fore does not point to a second 4 KB txiffer. This pre- 
vents inefficient use of merrwy because mennory space 
would be assigned for an unused buffer otherwise. 
[0036] In the WAFL ffle system as shown in Figure 
10. a WAFL in-core mode structure 1010 references a 
tree of WAFL buffer structures 1020-1024 and 1030- 
1034. It is similar to a tree of blocks on disk referenced 
by standard irxxJes corrprising block nunrd>ers that 
pointing to indirect and/or direct btlocks. Thus, WAFL 
inode 1010 contains not only the on-disk inode 1010D 
comprising 16 volume t)lock nuntjers, but also com- 
prises 1 6 buffer pointers 1 0 1 0C pointing to WAFL Ixiffer 
structures 1020-1024 and 1030-1034. WAFL buffers 
1030-1034 contain cached contents of t>locks refer- 
enced by volume block nunt)ers. 



[0037] The WAFL in-code inode 1010 contains 16 
buffer pointers 1010C. In turn, the 16 Ixjffer pointers 
1010C are referenced by a WAFL buffer structure 
101 OB that roots the tree of WAFL buffers 1020-1024 

5 and 1030-1034. Thus, each WAFL mode 1010 contains 
a WAFL buffer structure 1010B that points to the 16 
buffer pointers 1010C in the inode 1010. This facilitates 
algorithms for handling trees of txjffers that are imple- 
mented recursively. H the 16 buffer pointers 1010C in 

10 the inode 1010 were not represented by a WAFL buffer 
structure 10106, the recursive algorithms for operating 
on an entire tree of buffers 1020-1024 and 1030-1034 
would be difficult to implement. 

[0038] Figures 9A-9D are diagrams illustrating 

15 inodes having different levels of indirection. In Figures 
9A-9D, sinrplified indirect and direct WAFL buffers are 
illustrated to show irKlirection. However, it should be 
understood that the WAFL buffers of Rgure 9 represent 
corresponding irxiirect and direct txiffers of Rgure 10. 

20 For a small file having a size of 64 bytes or less, data is 
stored directly in the inode itself instead of the 16 buffer 
pointers. Figure 9A is a diagram illustratir^ a Level 0 
inode 820 that is the same as inode 820 shown in Fig- 
ure 8 except that inode 820 comprises 64-bytes of data 

25 920 B instead of 16 buffer pointers 820 B. Therefore, 
additional buffers are not allocated for very small files. 
[0039J For a file having a size of less than 64 KB, 
each of the 16 buffer pointers directly refererrces a 4 KB 
direct WAFL txiffer. Figure 9B is a diagram illustrating a 

30 Level 1 inode 820 comprising 16 buffer pointers 820B. 
The buffer pointers PTR0-PTR15 point to correspond- 
ing 4 KB direct WAFL buffers 922A-922C. 
[0040] For a file having a size that is greater than or 
equal to 64 KB and is less than 64 MB, each of the 16 

35 buffer pointers references a single-indirect WAFL buffer. 
In turn, each 4 KB single-indirect WAFL txjffer com- 
prises 1024 buffer pointers tat reference 4 KB direct 
WAFL buffers. Figure 9C is a diagram illustrating a Level 
2 inode 820 comprising 16 buffer pointers 820B that ref- 

40 erence 1 6 single-indirect WAFL buffers 930A-930C. As 
shown in Figure 9C, buffer pointer PTRO points to sin- 
gle-indirect WAFL buffer 930A. Single-indirect WAFL 
buffer 930A comprises 1024 pointers that reference 4 
KB direct WAFL buffers 940A-940C. Similarly, single- 

45 indirect WAFL buffers 930B-930C can each address up 
to 1024 direct WAFL buffers. 

[0041] For a file size greater than 64 MB. the 16 
txrffer pointers of the inode reference double-indirect 
WAFL buffers. Each 4 KB douttle-indirect WAFL buffer 

50 comprises 1024 pointers pointing to corresponding sin- 
gle-indirect WAFL buffers. In turn, each single-indirect 
WAFL buffer conprises 1024 pointers that point to 4KB 
direct WAFL txiffers. Thus, up to 64 GB can be 
addressed. Figure 9D is a diagram illustrating a Level 3 

55 inode 820 comprising 16 pointers 820 B wherein point- 
ers PTRO, PTRl. and PTR15 reference double-indirect 
WAFL buffers 970A. 970B. and 970C, respectively. Dou- 
ble-indirect WAFL txjffer 970A conprises 1024 pointers 
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that point to 1024 single-indirect WAFL buffers 980 A- 
980B. Eacfi single-indirect WAFL buffer 980A-980B, in 
turn, references 1024 direct WAFL buffers. As shown in 
Figure 9D, single-indirect WAFL txiffer 980 A references 
1024 direct WAFL buffers 990A-990C and single-indi- 5 
rect WAFL buffer 980B references 1024 direct WAFL 
buffers 990D-990F 

Directories 

10 

[0042] Directories in the WAFL system are stored in 
4 KB blocks that are divided into two sections. Figure 14 
is a diagram illustrating a directory block 1410 accord- 
ing to the present invention. Each directory block 1410 
conprises a first section 1 41 OA conprising fixed length is 
directory entry structures 1412-1414 and a second sec- 
tion 1 41 OB containing the actual directory nanrtes 1 41 6- 
1418. Each directory entry also contains a file id and a 
generation. This information identifies what file the entry 
references. This information is well-known in the art, 20 
and therefore is not illustrated in Rgure 14. Each entry 
1412-1414 in the first section 141 OA of the directory 
t>lock has a pointer to its name in the secorxf section 
141 OB. Further, each entry 1412-1414 includes a hash 
value dependent tpon its name in the second section 25 
141 OB so that the name is examined only when a hash 
hit (a hash match) occurs. For example, entry 1412 of 
the first section 141 OA comprises a hash value 1412A 
and a pointer 1412B. The hash value 1412A is a value 
dependent upon the directory name 30 
"DIRECTORY_ABC'* stored in variable length entry 
1416 of the second section 141 OB. Pointer 1412B of 
entry 1410 points to the variable length entry 1416 of 
second section 1410B. Using fixed length directory 
entries 1412-1414 in the first section 141 OA speeds up 35 
the process of name lookup. A calculation is not 
required to find the next entry in a directory block 1410. 
Further, keeping entries 1412-1414 in the first section 
small 1 410A improves the hit rate for file systems with a 
line-fill data cache. 40 

M eta-Data 

[0043] WAFL keeps information that desaibes a file 
system in files known as meta-data. Meta-data com- 45 
prises an inode file, inomap file, and a bikmap file. 
WAFL stores its meta-data in files that may be written 
anywhere on a disk. Because all WAFL meta-data is 
kept in files, it can be written to any location just like any 
other file in the file system. so 
[0044] An first meta-data file is the "inode file" that 
contains modes desatoing all other files in the file sys- 
tem. Rgure 12 is a diagram illustrating an inode file 
1210. The inode file 1210 may be written anywhere on 
a disk unlike prior art systems that write "inode tables" ss 
to a fixed location on disk. The inode file 1210 contains 
an inode 1210A-1210F for each file in the file system 
except for the inode file 1210 itself. The inode file 1210 
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is pointed to by an inode referred to as the "root inode". 
The root inode is kept in a fixed location on disk referred 
to as the file system information (fsinfo) block described 
below. The inode file 1210 itself is stored in 4 KB tJocks 
on disk (or 4 KB twjffers in memory). Rgure 12 illus- 
trates that inodes 1210A-1210C are stored in a 4 KB 
buffer 1220. For on-disk mode sizes of 128 bytes, a 4 
KB buffer (or t>lock) comprises 32 inodes. The incore 
inode file 1210 is composed of WAFL buffers 1220. 
When an incore inode (i.e.. 121 OA) is loaded, the on- 
disk inode part of the incore inode 121 OA is copied in for 
the buffer 1220 of the inode file 1210. The buffer data 
itself is loaded from disk Writing data to disk is done in 
the reverse order. The irxx)re inode 1210A, which is a 
copy of the ondisk inode. is copied to the corresporxling 
buffer 1220 of the inode file 1210. Then, the inode file 
1210 is write-allocated, and the data stored in the buffer 
1220 of the inode file 1210 is written to disk. 

[0045] Another meta-data file is tiie "block map" 
(bikmap) file. Figure 1 1 A is a diagram illustrating a bik- 
map file 1110. The bikmap file 1110 contains a 32-bit 
entry 1 1 1 0A-1 1 1 0C for each 4 KB tJock in the disk sys- 
tem. It also serves as a free-block map file. The t>lkmap 
file 1 1 1 0 indicates whetiier or not a disk block has been 
allocated. Figure 11B is a diagram of a block entry 
1110A of bikmap file 1110 (shown in Figure 11 A), As 
shown in Rgure 11B, entry 1110A is comprised of 32 
bits (BIT0-BIT31). Bit 0 (BITO) of entry 1110A is the 
active file system bit (FS-BIT). The FS-bit of entry 
1 1 10A indicates whether or not tiie corresponding block 
is part of the active file system. Bits 1-20 (BIT1-BIT20) 
of entry 1 1 1 0A are bits that irxJicate whether the block is 
part of a corresponding snapshot 1-20. The next upper 
10 bits (BIT21-B1T30) are reserved. Bit 31 (BlT31)isthe 
consistency point bit (CP- BIT) of entry 1 1 10 A. 
[0046] A block is available as a free block in the file 
system when all bits (BIT0-BIT31) in the 32 -bit entry 
1 1 10A for the block are clear (reset to a value of 0). Rg- 
ure 1 1C is a diagram illustrating entry 1 1 10A of Rgure 
1 1 A indicating the disk block is free. Thus, the block ref- 
erenced by entry 1 1 1 0A of bikmap file 1 1 1 0 is free when 
bits 0-31 (BIT0-BIT31) all have values of 0. Rgure 11D 
is a diagram illustrating entry 1 1 10A of Rgure 1 1 A ind- 
cating an allocated block in the active file system. When 
bit 0 (BITO), also referred to as the FS-bit. is set to a 
value of 1 , the entry 1 1 10A of t^lkmap file 1110 indicates 
a block that is part of the active file system. Bits 1-20 
(B1T1-BIT20) are used to indicate corresponding snap- 
shots, if any, that reference the b>lock. Snapshots are 
described in detail below. If bit 0 (BITO) is set to a value 
of 0, this does not necessarily indicate that the block is 
available for allocation. All the snapshot bits must also 
be zero for the trfock to be allocated. Bit 31 (BIT31) of 
entry 1 1 lOA always has the same state as bit 0 (BITO) 
on disK however, when loaded into mennory bit 31 
(BIT31) is used for txx)kkeeping as part of a consist- 
ency point. 

[0047] Another meta-data file is the "inode map" 
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(inomap) file that serves as a free inode map. Figure 
13A is a diagram illustrating an inomap file 1310. The 
inomap file 1310 contains an 8-bit entry 1310A-1310C 
for each block in the inode file 1210 shown in Figure 12. 
Each entry 1 31 OA- 131 OC is a count of allocated inodes 
in the corresponding block of the inode file 1210. Figure 
13A shows values of 32. 5, and 0 in entries 1310A- 
1310C. respectively The inode ffle 1210 must still be 
inspected to find which inodes in the block are free, txjt 
does not require large nurnbers of random blocks to be 
loaded into memory from disk. Since each 4 KB block 
1220 of inode file 1210 holds 32 inodes, the 8-bit 
inomap entry 1310A-1310C for each t>lock of inode ffle 
1210 can have values ranging from 0 to 32. When a 
block 1220 of an inode file 1210 has no inodes in use. 
the entry 1310A-1310C for it in inomap ffle 1310 is 0. 
When all the inodes in the block 1220 inode file 1210 
are in use, the entry 1310A-1310C of the. inomap file 
1310 has a value of 32. 

[0048] Figure 13B is a diagram illustrating an 
inomap file 1350 that references the 4 KB blocks 
1340A-1340C of inode file 1340. For example, inode file 
1340 stores 37 inodes in three 4 KB blocks 1340A- 
1340C. Blocks 1340A-1340C of inode file 1340 contain 
32. 5. and 0 used modes, respectively- Entries 1350A- 
1350C of blkmap file 1350 reference blocks 1340A- 
1340C of inode file 1340, respectively. Thus, the entries 
1350A-1 350C of inomap file have values of 32, 5, and 0 
for blocks 1340A-1340C of inode file 1340. In turn, 
entries 1350A-1350C of inomap file indicate 0, 27, and 
32 free modes in blocks 1340A-1340C of inode fBe 
1340. respectively. 

[0049] Referring to Figure 1 3. using a bitmap for the 
entries 1310A-1310C of inomap file 1310 instead of 
counts is disadvantageous since it would require 4 
bytes per entry 1310A-1310C for block 1220 of the 
inode file 1210 (shown in Rgure 12) instead of one byte. 
Free inodes in the block(s) 1220 of the inode file 1210 
do not need to be indicated in the inor^ap file 1310 
because the inodes themselves contain that informa- 
tion. 

[OOSO] Figure 15 is a diagram illustrating a file sys- 
tem information (fsinfo) structure 1510. The root mode 
1510B of a file system is kept in a fixed location on disk 
so that it can be kx:ated during booting of the file sys- 
tem. The fsinfo t)lock is not a meta-data file but is part of 
the WAFL system. The root inode 1 51 OB is an inode ref- 
erencing the inode file 1210. It is part of the file system 
information (fsinfo) structure 1510 that also contains 
information 151 OA including the number of l>locks in the 
file system, the creatfon time of the ffle system, etc. The 
nrdscellaneous information 151 OA further comprises a 
checksum 1 510C (described below). Except for the root 
nnode 151 OB itself, tfiis information 151 OA can be kept 
in a meta<lata fSe in an alternate embodiment. Two 
identical copies of the fsinfo structure 1510 are kept in 
fixed locations on disk. 

(0051 J Rgure 1 6 is a diagram illustrating the WAFL 



file system 1670 in a consistent state on disk compris- 
ing two fsinfo blocks 1610 and 1612, inode file 1620, 
k)lkmap file 1630. inomap file 1640, root directory 1650, 
and a typical file (or directory) 1660. Inode file 1620 is 

5 comprised of a plurality of inodes 1620A-1620D that ref- 
erence other files 1630-1660 in the ffle system 1670. 
Inode 1620 A of inode file 1620 references blkmap file 
1630- Inode 1620B references irwnriap file 1640. Inode 
1620C references rootdrectory 1650. Irrode 1620D ref- 

10 erences a typtcal file (or directory) 1660. Thus, the 
Inode ffle points to all files 1630-1660 in the file system 
1670 except for fsinfo blocks 1610 and 1612. Fsinfo 
blocks 1610 and 1612 each contain a copy 1610B arxj 
1612B of the inode of the inode fBe 1620. respectively. 

15 Because the root inode 1610B ard 1612B of fsinfo 
blocks 1610 and 1612 describes the inode ffle 1620, 
that in turn desaibes the rest of the files 1630-1660 in 
the file system 1670 inducting alt meta-data files 1630- 
1 640, the root inode 1 61 OB and 1 61 2B is viewed as the 

20 root of a tree of blocks. The WAFL system 1670 uses 
this tree structure for its update method (consistency 
point) and for irrplementing snapshots, both described 
below- 

25 List of Inodes Having Dirty Blocks 

[0052] WAFL in-core inodes (i e.. WAFL inode 1010 
shown in Figure 10) of the WAFL file system are main- 
- tained in different linked lists according to their status. 

30 IrKXies that refererKe dirty blocks are kept in a dirty 
inode list as sfK>wn in Figure 2. inodes containing valid 
data that is rxrt dirty are kept in a separate list and 
inodes that have no valid data are kept in yet another, as 
is well-known in the art. The present invention utilizes a 

35 list of inodes having dirty data blocks that facilitates find- 
ing all of the modes that need write allocations to be 
done. 

[00531 Figure 2 is a diagram illustrating a list 210 of 
dirty inodes according to the present invention. The list 

40 210 of dirty inodes comprises WAFL in-core inodes 
220-1750. As shown in Figure 17. each WAFL in-core 
inode 220-250 comprises a pointer 220A-250A. respec- 
tively, that points to another inode in the linked list. For 
example, WAFL inodes 220-250 are stored in memory 

45 at locations 2048. 21 52, 2878. 3448 and 3712. respec- 
tively. Thus, pointer 220A of inode 220 contains address 
2152. It points therefore to WAFL inode 222. In turn. 
WAFL inode 222 points to WAFL inode 230 using 
address 2878. WAFL inode 230 points to WAFL inode 

50 240. WAFL inode 240 points to inode 1 750. The pointer 
250A of WAFL inode 250 contains a null value and 
therefore does not point to another inode. Thus, it is the 
last mode in the list 210 of dirty inodes. Each irK)de in 
the list 210 represents a file comprising a tree of Ixjffers 

55 as depicted in Figure 10. At least one of the txjffers ref- 
erenced by each inode 220-250 is a dirty buffer. A dirty 
buffer contciins oKKjified data that must be written to a 
new disk location in the WAFL system. WAFL always 
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writes dirty buffers to new locations on disk. 

CONSISTENCY POINTS 

[0054] The WAFL disk structure desaibed so far is 5 
static. In the present invention, changes to the file sys- 
tem 1670 are tightly controlled to maintain the file sys- 
tem 1670 in a consistent state. The file system 1670 
progresses from one self-consistent state to another 
self-consistent state. The set (or tree) of self-consistent io 
blocks on disk that is rooted by the root inode 151 OB is 
refen-ed to as a consistency point (CP). To implement 
consistency points, WAFL always writes new data to 
unallocated blocks on disk. It never overwrites existing 
data. Thus, as long as the root inode 1510B is rxjt is 
updated, the state of the file system 1670 represented 
on disk does not chiange. However, for a file system 
1670 to be useful, it must eventually refer to newly writ- 
ten data, therefore a new consistency point must be 
written. 20 
[0055] Referring to Figure 16. a new consistency 
point is written by first flushing all file system blocks to 
new locations on disk (including the blocks in meta-data 
files such as the inode file 1620, t>lkmap file 1630, and 
inomap file 1640). A new root inode 1610B and 1612B 2s 
for the file system 1670 is then written to disk With this 
method for atomically updating a file system, the on- 
disk file system is never inconsistent The on-disk file 
system 1670 reflects an old consistency point up until 
the root inode 161 OB arxi 1612B is written. Immediately 30 
after the root inode 1610B and 1612B is written to disk, 
the file system 1670 reflects a new consistency point. 
Data structures of the file system 1670 can be updated 
in any order, and there are no ordering constraints on 
disk writes except the one requirement that all blocks in 3S 
the file system 1670 must be written to disk before the 
root inode 1610B and 1612B is updated. 
[0056] To convert to a new consistency point, the 
root mode 161 OB and 1612B must be updated reliably 
and atomically WAFL does this by keeping two identical 40 
copies of the fsinfo structure 1610 and 1612 containing 
the root inode 1 6 1 0B and 1 61 28. During updating of the 
root inode 1610B and 1612B, a first copy of the fsinfo 
structure 1610 is written to disk, and then the secortd 
copy of the fsinfo structure 1 61 2 is written. A checksum 45 
1610C and 1 61 2C in the fsinfo structure 1610 and 1612. 
respectively, is used to detect the occurrence of a sys- 
tem crash that corrupts one of the copies of the fsinfo 
structure 1610 or 1612. each containing a copy of the 
root mode, as it is being written to disk. Normally, the so 
two fsinfo structures 1610 and 1612 are identical. 

AlQorithm for Generating a Consistency Point 

[0057J Figure 5 is a diagram illustrating the method 55 
of producing a consistency point In step 510. all "dirty" 
modes (inodes that point to new blocks containing mod- 
ified data) in the system are marked as being in the con- 
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sistency poirrt their contents, and only their contents, 
are written to disk. Only when those writes are complete 
are any writes from other inodes allowed to reach disk. 
Further, during the time dirty writes are occurring, no 
new modifications can be made to inodes that are in the 
consistency point. 

[0058] In addition to setting the consistency point 
flag for all dirty modes that are part of the consistency 
point a global consistency point flag is set so that user- 
requested changes behave in a tightly controlled man- 
ner. Once the glotel consistency point flag rs set. user- 
requested changes are not allowed to affect irK>des that 
are in the consistency point Further, only inodes having 
a consistency point flag that is set are allocated disk 
space for their dirty blocks. Consequently, the state of 
the file system will be flushed to disk exactly as it was 
when the consistency point began. 
[0059] In step 520. regular files are flushed to disk. 
Rushing regular files comprises the steps of allocating 
disk space for dirty tjlocks in the regular ffles. and writ- 
ing the corresponding WAFL buffers to disk. The inodes 
themselves are then flushed (copied) to the inode file. 
All inodes that need to be written are in either the list of 
inodes having dirty buffers or the list of modes that are 
dirty but do not have dirty buffers. When step 520 is 
completed, there are no more ordinary inodes in the 
consistency point and all incoming I/O requests suc- 
ceed unless the requests use buffers that are still locked 
up for disk I/O operations. 

[0060] In step 530. special files are flushed to disk. 
Flushing special files comprises the steps of allocating 
disk space for dirty blocks in the two special files: the 
inode file ard the Wkmap file, updating the consistency 
bit (CP -bit) to match the active file system bit (FS-bit) for 
each entry in the btkmap file, and then writing the tjlocks 
to disk. Write allocating the inode file and the tJkmap is 
complicated because the process of write allocating 
them changes the files themselves. Thus, in step 530 
writes are disabled while changing these files to prevent 
important bkx:ks from locking up in disk I/O operations 
before the changes are completed. 
[0061] Also, in step 530. the creation and deletion 
of snapshots, described below, are performed because 
it is the only point in time when the file system, except 
for the fsinfo block, is completely self consistent and 
about to be written to disk. A snapshot is deleted from 
the file system before a new one is created so that the 
same snapshot inode can be used in one pass. 
[0062] Figure 6 is a ftow diagram illustrating the 
steps that step 530 comprises. Step 530 allocates disk 
space for the bikmap file and the mode file and copies 
the active FS-bit into the CP-bit fa each entry in the bik- 
map file. In step 610. the inode for the bikmap file is pre- 
flushed to the inode file. This ensures that the block in 
the inode file tfTat contains the inode of the bikmap file is 
dirty so that step 620 allocates disk space for it. 
[0063] In step 620, disk space is allocated for all 
dirty blocks in the inode and t>lkmap files. The dirty 
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blocks include the block in the inode file containing the 
inode of the bikmap file is dirty. 
[0064] In step 630. the inode for the bikmap file is 
flushed again, however this time the actual inode is writ- 
ten to the pre-f lushed block in the inode file. Step 610 
has already dirtied the block of the mode file that con- 
tains the mode of the bikmap file. Thus, another write- 
allocate, as in step 620, does not need to be scheduled. 
[0065] In step 640, the entries for each trfock In the 
tdkmap file are updated. Each entry is updated by copy- 
ing the active FS-bit to the CP-bit (i.e., copying bit 0 into 
bit 31) for all entries in dirty blocks in the bikmap file. 
[0066] In step 650. all dirty blocks in the bikmap and 
inode files are written to disk. 

(00671 Only entries in dirty blocks of the t>lkmap f fle 
need to have the active file system bit (FS-bit) copied to 
the consistency point bit (CP -bit) in step 640. Immedi- 
ately after a consistency point, all b!kmap,entries have 
same value for both the active FS-bit and CP-bit. As 
time progresses, some active FS-bits of t)lkmap ffle 
entries for the file system are either cleared or set. The 
Ijlocks of the blknoap file containing the changed FS-bits 
are accordingly marked dirty. During the following con- 
sistency point. t)locks that are clean do not need to be 
re -copied. The clean trfocks are not copied because 
they were not dirty at the previous consistency point arxJ 
nothing in the blocks has changed since then. Thus, as 
long as the file system is initially created with the active 
FS-bt and the CP -bit having the same value in all bik- 
map entries, only entries with dirty blocks need to be 
updated at each consistency point. 
[0068] Refemng to Rgure 5, in step 540, the file 
system information (fsinfo) block updated and then 
flushed to disk. The fsinfo block is updated by writing a 
new root inode for the inode file into it. The fsinfo block 
is written twice. It is first written to one kx^ation and then 
to a second location. The two writes are performed so 
that when a system crash occurs during either write, a 
self-consistent file system exists on disk. Therefore, 
either the new consistency point is available if the sys- 
tem crashed while writing the second fsinfo block or the 
previous consistency point (on disk before the recent 
consistency point began) is availaNe if the first fsinfo 
block failed. When the file system is restarted after a 
system failure, the highest generation comt for a con- 
sistency point in the fsinfo blocks fiaving a correct 
checksum value is used. This is described in detail 
below. 

[0069] In step 550. the consistency point is com- 
pleted. This requires that any dirty inodes that were 
delayed because they were not part of the consistency 
point be requeued. Any inodes that had their state 
change during the consistency point are in the consist- 
ency point wait (CP_WAIT) queue. The CP_WAIT 
queue holds inodes that changed t>efore step 540 com- 
pleted. Ixjt after step 510 when the consistency point 
started- Once the consistency point is completed, the 
modes in the CP_WAIT queue are re-queued accord- 
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ingly in the regular list of inodes with dirty buffers and list 
of dirty inodes without dirty buffers. 

Single Ordering Constraint of Consistency Point 

5 

[0070] The present invention, as illustrated in Fig- 
ures 20A-20C, fias a single ordering constraint The sin- 
gle ordering constraint is that the fsinfo block 1810 is 
written to disk only after alt the other blocks are written 

10 to disk. The writing of the fsinfo block 1810 is atomic, 
otherwise the entire fSe system 1830 could be lost. 
Thus, the WAFL file system requires the fsinfo block 
1810 to be written at once and not be in an inconsistent 
state. As illustrated in Figure 15. each of the fsinfo 

15 btocks 1810 (1510) contains a checksum 1510C and a 
generation count 1510D. 

[0071] Figure 20A illustrates the updating of the 
generation count 1810D and 1870D of fsinfo blocks 
1810 arxJ 1870. Each time a consistency point (or snap- 

20 shot) is performed, the generation count of the fsinfo 
block is updated. Rgure 20A illustrates two fsinfo btlocks 
1810 and 1870 having generation counts 1810D and 
1870D. respectively, that have the same value of N indi- 
cating a consistency point for the file system. Both fsinfo 

25 blocks reference the previous consistency point (old file 
system on disk) 1830. A new version of the file system 
exists on disk and is referred to as new consistency 
point 1831. The generation count is incremented every 
consistency point 

30 [0072] In Rgure 20B. the generation count 1810D 
of the first fsinfo t>lock 1810 is updated and given a 
value of N+1. It is then written to disk. Figure 20B illus- 
trates a value of N+1 for generation count 1810D of 
fsinfo block 1810 whereas the generatbn count 1870D 

35 of the second fsinfo block 1870 has a value of N. Fsinfo 
block 1810 references new consistency point 1831 
whereas fsinfo block 1870 references old consistency 
point 1830. Next, the generation count 1870D of fsinfo 
l>lock 1870 is updated and written to disk as illustrated 

40 in Rgure 20C. In Figure 20C, the generation count 
1870D of fsinfo tdock 1870 has a value of N+1. There- 
fore the two fsinfo t)locks 1810 and 1870 have the same 
generation count value of N+1 . 
[0073] When a system crash occurs between fsinfo 

45 block updates, each copy of the fsinfo trfock 1810 and 
1870 will have a self consistent checksum (not shown in 
the diagram), but one of the generation nunt)ers 181 OD 
or 1870D will have a higher value. A system aash 
occurs when the file system is in the state illustrated in 

50 Figure 20B. For exarrple, in the preferred embodiment 
of the present invention as illustrated in Rgure 20B, the 
generation count 1810D of fsinfo block 1810 is updated 
before the second fsinfo block 1870. Therefore, the gen- 
eration count 1810D (value of one) is greater than the 

55 generation count 1870D of fsinfo block 1870. Because 
the generation count of the first fsinfo bilock 1810 is 
higher, it is selected for recovering the file system after 
a system crash. This is done because the first fsinfo 
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block 1810 contains more current data as irxiicated by 
its generation count 181 CD. For the case when the first 
fsinfo block is corrupted because the system crashes 
while it is being updated, the other copy 1870 of the 
fsinfo block is used to recover the file system 1830 into 5 
a consistent state. 

[0074] tt is not possible for both fsinfo blocks 1810 
and 1870 to be updated at the same time in the present 
invention. Therefore, at least one good copy of the fsinfo 
block 1810 and 1870 exists in the file system. This 10 
allows the file system to alsways be recovered into a 
consistent state. 

[0075] WAFL does not require special recovery pro- 
cedures. This is unlike prior art systems that use log- 
ging, ordered writes, and nx>stly ordered writes with is 
recovery. This is because only data corruption, which 
RAID protects agair>st, or software can corrupt a WAFL 
file system. To avoid losing data when the system fails. 
WAFL may keep a non -volatile transaction log of all 
operations that have occurred since the most recent 20 
consistency point. This log is completely irxieperxient of 
the WAFL disk format and is required only to prevent 
operations from being lost during a system aash. How- 
ever, it is not required to maintain consistency of the file 
system. 25 

Generating A Consistency Point 

[0076] As described above, changes to the WAFL 
file system are tightly controlled to maintain the file sys- 30 
tem in a consistent state. Figures 17A-17H illustrate the 
generation of a consistency point for a WAFL file sys- 
tem. The generation of a consistency point is described 
with reference to Figures 5 and 6. 

[0077] In Figures 17A-17L. buffers that have not 35 
been modified do not have asterisks beside them. 
Therefore, buffers contain the same data as corre- 
sponding on-disk t>locks. Thus, a trfock may be loaded 
into memory but it has not changed with respect to its 
on disk version. A buffer with a single asterisk (*) beside 40 
it indicates a dirty txjffer in memory (its data is modi- 
fied). A buffer with a douttle asterisk beside it indi- 
cates a dirty txiffer that has been allocated disk space. 
Finally a txjffer with a triple asterisk {***) is a dirty buffer 
that is written into a new block on disk. This convention 45 
for denoting the state of buffers is also used with respect 
to Figures 2 1A-2 IE. 

[0078] Figure 17A illustrates a list 2390 of inodes 
with dirty buffers comprising inodes 2306A and 2306B, 
Inodes 2306A and 2306B reference trees of buffers so 
where at least one txjffer of each tree has been nxxfi- 
fied. Initially, the consistency point flags 2391 and 2392 
of inodes 2306A and 2306B are cleared (0). While a list 
2390 of modes with dirty Ixiffers is illustrated for the 
present system, it should be obvious to a person skilled ss 
in the art that other lists of inodes may exist in memory. 
For instance, a list of inodes that are dirty Ixjt do not 
have dirty buffers is maintained in memory These 



inodes must also be marked as being in the consistency 
point. They must be flushed to disk also to write the dirty 
contents of the inode file to disk even though the dirty 
inodes do not reference dirty t)Iocks. This is done in step 
520 of Figure 5. 

[0079] Figure 178 is a diagram illustrating a WAFL 
file system of a previous consistency point comprising 
fsinfo block 2302, inode file 2346, Wkmap file 2344 and 
fSes 2340 and 2342. File 2340 comprises blocks 2310- 
2314 containing data "A". "B", and "C". respectively File 
2342 comprises data blocks 2316-2320 comprising 
data "D". "E". and "F", respectively. BIkmap file 2344 
comprises block 2324. The inode file 2346 comprises 
two 4KB blocks 2304 and 2306. The second t>lock 2306 
comprises inodes 2306A-2306C that reference file 
2340, f3e 2342, and bikmap file 2344, respectively This 
is illustrated in block 2306 by listing the file nunrt^er in 
the inode. Fsinfo block 2302 comprises the root inode. 
The root inode references t>locks 2304 and 2306 of 
inode file 2346. Thus, Figure 17B illustrates a tree of 
buffers in a file system rooted by the fsinfo block 2302 
containing the root inode. 

[0080] Figure 1 7C is a diagram illustrating two mod- 
ified txiffers for blocks 2314 and 2322 in menwry The 
active file system is modified so that the block 23 1 4 con- 
taining data "C" is deleted from file 2340. Also, the data 
"F" stored in block 2320 is modified to "F-prime", and is 
stored in a buffer for disk block 2322. it should be under- 
stood that the modified data contained in buffers for disk 
blocks 2314 and 2322 exists only in memory at this 
time. All other blocks in the active file system in Figure 
17C are not ntodified, and therefore have no asterisks 
beside them. However, some or all of these blocks may 
have corresponding clean Ix/ffers in memory 
[0081] Figure 17D is a diagram illustrating the 
entries 2324A-2324M of the bikmap file 2344 in mem- 
ory. Entries 2324A-2324M are contained in a buffer for 4 
KB block 2324 of bikmap file 2344. As described previ- 
ously BITO and BIT31 are the FS-BIT and CP-BIT, 
respectively The consistency point bit (CP-BIT) is set 
during a consistency point to ensure that the corre- 
sponding block is not modified once a consistency point 
has begun, but not finished. BIT1 is the first snapshot bit 
(described below). Bikmap entries 2324A and 2324B 
illustrate that, as shown in Figure 17B, the 4 KB blocks 
2304 and 2306 of inode file 2346 are in the active file 
system (FS-BIT equal to 1) and in the consisterrcy point 
(CP-BIT equal to 1). Similarly, the other blocks 2310- 
231 2 and 231 6-2320 and 2324 are in the active file sys- 
tem and in the consistency point. However, blocks 2308, 
2322, and 2326-2328 are neither in the active file sys- 
tem nor in the consistency point (as indicated by BITO 
and BIT31. respectively). The entry for deleted block 
2314 has a value of 0 in the FS-BIT indicating that it has 
been renxjved from the active file system. 
[0082] In step 510 of Figure 5. all "dirty" inodes in 
the system are marked as being in the consistency 
point. Dirty inodes include both inodes that are dirty af>d 
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inodes that reference dirty buffers. Figure 171 illustrates 
a list of irKxies with dirty Ixiffers where the consistency 
point flags 2391 and 2392 of inodes 2306 A and 2306B 
are set (1). Inode 2306A references block 2314 contain- 
ing data "C" of file 2340 which is to be deleted from the 
active file system. Inode 2306B of block 2306 of inode 
file 2346 references file 2342. Block 2320 containing 
data "P has been modified and a new t^lock containing 
data "P nnust be allocated. In step 510, the dirty inodes 
2306A and 2306B are copied into the buffer for block 
2308, The buffer for block 2306 is sut>sequentfy written 
to disk {in step 530). This is illustrated in Pigure 17E. 
The modified data exists in mennory only, and the buffer 
2308 is nr^rked dirty. The inconsistency point flags 2391 
and 2392 of inodes 2306A and 2306B are then cleared 
(0) as illustrated in Figure 1 7A. This releases the inodes 
for use by other processes. 

[0083] In step 520. regular files are flushed to disk. 
Thus, tjlock 2322 is allocated disk space. Block 2314 of 
file 2340 is to be deleted, therefore nothing occurs to 
this tAock until the consistency point is sutjsequently 
completed. Block 2322 is written to disk in step 520. 
This is illustrated in Rgure 17F where buffers for blocks 
2322 and 2314 have been written to disk (marked by 
***). The intermediate allocation of disk space (**) is not 
shown. The inodes 2308A and 2308B of block 2308 of 
inode file 2346 are flushed to the inode file. Inode 
2308A of block 2308 references blocks 2310 and 2312 
of file 2346. Inode 2308B references blocks 231 6. 231 8. 
2322 for file 2342. As illustrated in Figure 17F, disk 
space is allocated for t>lock 2308 of inode 2346 arxJ for 
direct block 2322 for file 2342. However, the file system 
itself has not been updated. Thus, the file system 
remains in a consistent state. 

[0084] In step 530. the bikmap file 2344 is flushed 
to disk- This is illustrated in Figure 17G where the bik- 
map file 2344 is indicated as being dirty by the asterisk. 
[0085] In step 61 0 of Figure 6. the inode for the trfk- 
map file is pre-f lushed to the inode file as illustrated in 
Figure 17H. Inode 2308C has been flushed to block 
230B of inode file 2346. However, inode 2308C still ref- 
erences block 2324. In step 620. disk space is allocated 
for bikmap file 2344 and inode file 2346. Block 2308 is 
allocated for inode file 2346 and block 2326 is allocated 
for bikmap file 2344. As described atxive, block 2308 of 
inode file 2346 contains a pre-flushed inode 2308C for 
bikmap file 2344. In step 630, the inode for the bikmap 
file 2344 is written to the pre-flushed block 2308C in 
inode 2346. Thus, incore inode 2308C is updated to ref- 
erence block 2324 in step 620. and is copied into the 
buffer in memory containing t>lock 2306 that is to be 
written to l>lock 2308. Tiiis is illustrated in Rgure 1 7H 
where inode 2308C references block 2326. 
[0086] In step 640. the entries 2326A-2326L for 
each block 2304-2326 in the blkn«p file 2344 are 
updated in Figure 17J. Blocks that have not changed 
since the consisterKy point began in Figure 17B have 
the same values in their entries. The entries are 



updated t>y copying BITO (FS-bit) to the consistency 
point bit (BIT31). Block 2306 is not part of the active file 
system, therefore BITO is equal to zero (BITO was 
turned off in step 620 when block 2308 was allocated to 

5 fx>ld the new data for that part of the inode file). This is 
illustrated in Rgure 17J for entry 2326B. Similarly, errtry 
2326F for block 2314 of file 2340 has BITO and BIT31 
equal to zero. Block 2320 of file 2342 and block 2324 of 
bikmap file 2344 are handled similarly as shown in 

10 entries 2361 and 2326K. respectively. In step 650, dirty 
block 2308 of inode file 2346 and dirty block 2326 of bik- 
map file 2344 are written to disk. This is indicated in Rg- 
ure 1 7K by a triple asterisk {*") beside blocks 2308 and 
2326. 

15 [0087] Referring to Rgure 5, in step 540, the file 
system information block 2302 is flushed to disk, this is 
performed twice. Thus, fsinfo block 2302 is dirtied arxi 
then written to disk (incficated by ai triple asterisk) in Fig- 
ure 1 7L. In Rgure 1 7L, a single fsinfo tsfock 2302 is illus- 
20 trated. As shown in the diagram, fsinfo t>lock 2302 now 
referer>ces t>lock 2304 and 2308 of the inode file 2346. 
In Figure 1 7L, block 2306 is no longer part of the inode 
file 2346 in the active file system. Similarly, file 2340 ref- 
erenced by inode 2308 A of inode file 2346 comprises 
25 trfocks 2310 and 2312. Block 2314 is no longer part of 
file 2340 in this consistency point. File 2342 comprises 
iDlocks 2316. 2318. and 2322 in the new consistency 
point whereas tjlock 2320 is not part of file 2342, Fur- 
ther, block 2308 of inode file 2346 references a new t>Ik- 
30 rnap file 2344 comprising block 2326. 

[0088] As shown in Figure 17L. in a consistency 
point, the active file system is updated by copying the 
inode of the inode file 2346 into fsinfo t)lock 2302. How- 
ever, the blocks 2314. 2320, 2324. and 2306 of the pre- 
ss vious consistency point remain on disk. These l)locks 
are never overwritten when updating the file system to 
ensure that lx)th the old consistency point 1830 and the 
new consistency point 1831 exist on disk in Rgure 20 
during step 540 

40 

SNAPSHOTS 

[0089] The WAFL system supports snapshots, A 
snapshot is a read-only copy of an entire file system at 

45 a given instant when the snapshot is created. A newly 
created snapshot refers to exactly the same disk tslocks 
as the active file system does. Therefore, it is created in 
a small period of time and does not consume any addi- 
tional disk space. Only as data t>locks in the active file 

50 system are modified and written to new locations on 
disk does the snapshot begin to consume extra space. 
[0090] WAFL supports up to 20 different snapshots 
that are numbered 1 through 20. Thus. WAFL allows the 
creation of multiple "clones" of the same file system. 

55 Each snapshot is represented by a snapshot mode tat is 
similar to the representation of the active file system by 
a root inode. Snapshots are created by duplicating the 
root data structure of the file system. In the preferred 
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embodiment, the root data structure is the root inode. 
However, any data structure representative of an errtire 
tile system could be used. The snapshot inodes reside 
in a fixed location in the inode file. The limit of 20 snap- 
shots is imposed by the size of the t>lkmap entries. 
WAFL requires two steps to create a new snapshot N: 
copy the root inode into the inode for snapshot N; and. 
copy bit 0 into bit N of each bikmap entry in the bikmap 
file. Bit 0 indicates the blocks that are referenced by the 
tree beneath the root inode. 

[0091 1 The result is a new file system tree rooted by 
snapshot inode N that references exactly the same disk 
blocks as the root inode. Setting a conresponding bit in 
the Ijikmap for each bk>ck in the snapshot prevents 
snapshot blocks from being freed even if the active file 
no longer uses the snapshot blocks. Because WAFL 
always writes new data to unused disk locations, the 
snapshot tree does not change even though the active 
file system changes. Because a newty created snap- 
shot tree references exactly the same Islocks as the root 
inode. it consumes no additional disk space. Over time, 
the snapshot references disk blocks that would other- 
wise have been freed. Thus, over time the snapshot and 
the active file system share fewer and fewer blocks, and 
the space consumed by the snapshot increases. Snap- 
shots can be deleted when they consume unacceptable 
numbers of disk blocks. 

[00921 The list of active snapshots along with the 
names of the snapshots is stored in a meta-data file 
called the snapshot directory. The disk state is updated 
as described above. As with all other changes, the 
update occurs by automatically advancing from one 
consistency point to another. Modified blocks are written 
to unused locations on the disk after which a new root 
inode describing the updated file system is written. 

Overview of Snapshots 

[0093] Figure 18A is a diagram of the file system 
1830, before a snapshot is taken, where levels of indi- 
rection have been removed to provide a simpler over- 
view of the WAFL file system. The file system 1830 
represents the file system 1 690 of Figure 1 6. The file 
system 1830 is conprised of blocks 1812-1820. The 
inode of the inode file is contained in fsinfo block 1810. 
While a single copy of the fsinfo block 1810 is shown in 
Figure 18A. it should be understood that a second copy 
of fsinfo block exists on disk. The inode 181 OA con- 
tained in the fsinfo iDlock 1810 comprises 16 pointers 
that point to 16 blocks having the same level of indirec- 
tion. The t^ocks 1812-1820 in Figure 18A represent all 
blocks in the file system 1830 including direct blocks, 
indirect blocks, etc. Though only five t>locks 1812-1820 
are shown, each block nnay point to other blocks. 
[0094] Figure 1 8B is a diagram illustrating the crea- 
tion of a snapshot. The snapshot is made for the entire 
file system 1830 by simply copying the inode 1810A of 
the inode file that is stored in fsinfo block 1810 into the 
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snapshot inode 1822. By copying the inode 1810A of 
the inode file, a new file of inodes is created represent- 
ing the same file system as the active file system. 
Because the inode 181 OA of the inode file itself is cop- 

5 ied. No other blocks 1812-1820 need to be duplicated. 
The copied inode or snapshot inode 1822, is then cop- 
ied into the inode file that dirties a block in the inode file. 
For an inode file conprised of one or more levels of indi- 
rection, each indirect block is in turn dirtied. This proc- 

10 ess of dirtying blocks propagates through all the levels 
of indirection. Each 4 KB block in the inode file on disk 
contains 32 modes where each inode is 128 bytes long. 

[0095] The new snapshot inode 1822 of figure 18B 
points t>ack to the highest level of irKlirection blocks 

15 1812-1820 referenced by the inode 1810A of the inode 
file when the snapshot 1822 was taken. The inode file 
itself is a recursive structure because it contains snap- 
shots of the file system 1830. Each snapshot 1822 is a 
copy of the inode 181 OA of the inode file that is copied 

20 into the inode file. 

[0096] Figure 18C is a diagram illustrating the 
active file system 1830 and a snapshot 1822 when a 
change to the active file system 1830 sut)sequently 
occurs after the snapshot 1 822 is taken. As illustrated in 

25 the diagram, block 1818 comprising data "D" is modified 
after the snapshot was taken (in Figure 188), and there- 
fore a new t)lock 1824 containing data "Dprj^e" is allo- 
cated for the active file system 1830. Thus, the active 
file system 1830 comprises blocks 1812-1816 and 

30 1820-1824 but does not contain block 1818 containing 
data "D". However. tAock 1 818 containing data "D" is not 
overwritten t>ecause the WAFL system does not over- 
write blocks on disk. The tjlock 1818 is protected 
against being overwritten by a snapshot bit that is set in 

35 the Ijikmap entry for block 1818. Therefore, the snap- 
shot 1822 still points to the unmodified block 1818 as 
well as t>locks 1812-1816 and 1820. The present inven- 
tion, as illustrated in Figures 18A-18C, is unlike prior art 
systems that aeate "clones" of a file system where a 

40 done is a copy of all the trfocks of an inode file on disk. 
Thus, the entire contents of the prior art inode files are 
duplicated requiring large amounts (MB) of disk space 
as well as requiring substantial time for disk I/O opera- 
tions. 

45 [0097] As the active file system 1830 is modified in 
Figure 18C. it uses more disk space because the file 
system comprising t)locks 1812-1820 is not overwritten. 
In Rgure 18C. block 1818 is illustrated as a direct b»lock. 
However, in an actual file system, bkx:k 1818 may be 

50 pointed to by indirect block as well. Thus, when block 
1818 is modified and stored in a new disk location as 
block 1824. the con^esponding direct and indirect Ijlocks 
are also copied and assigned to the active file system 
1830. 

55 [0098] Figure 19 is a diagram illustrating the 
changes occurring in tjlock 1824 of Rgure 18C. Block 
1824 of Figure 18C is represented within dotted line 
1824 in Figure 19. Figure 19 illustrates several levels of 
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indirection for block 1824 of Figure 18C- The new block 
1910 that is written to disk in Rgure 1 8C is labeled 1910 
in Figure 19. Because block 1824 conprtses a data 
block 1910 containing modified data that is referenced 
by double indirection, two other blocks 1918 arxJ 1926 
are also modified. The pointer 1924 of single-indirect 
block 1918 references new bkx:k 1910. therefore l)lock 
1918 nrujst also be written to disk in a new location. Sim- 
ilarly, pointer 1928 of indirect block 1926 is nxxJified 
because it points to block 1918. Therefore, as shown in 
Figure 19. modifying a data block 1910 can cause sev- 
eral indrect blocks 1918 arxJ 1926 to be modified as 
well. This requires blocks 1918 and 1926 to be written to 
disk in a new kx:atton as well. 

[0099] Because the direct and indirect blocks 1910. 
1918 and 1926 of data block 1824 of Figure 18C have 
changed artd been written to a new location, the inode 
in the inode file is written to a new trfock_ The modified 
block of the inode file is allocated a new block on disk 
since data cannot be overwritten. 
[01 00] As shewn in figure 1 9, bUock 1 91 0 is pointed 
to by irdirect blocks 1926 and 1918. respectively Thus 
when block 1910 is modified and stored in a new disk 
location, the corresponding direct and indirect blocks 
are also copied arxJ assigned to the active file system. 
Thus. a nun±>er of data structures must be updated. 
Changing direct block 1910 and indirection blocks 1918 
and 1926 causes the bikmap file to be modified, 
[0101] The key data structures for snapshots are 
the t^kmap entries where each entry has multiple bits 
for a snapshot This enables a plurality of snapshots to 
by created. A snapshot is a picture of a tree of blocks 
that is the file system (1830 of Rgure 18). As long as 
new data is not written onto blocks of the snapshot, the 
file system represented by the snapshot is not changed. 
A snapshot is similar to a consistency point. 
[0102] The file system of the present invention is 
conpletely consisterrt as of the last time the fsinfo 
blocks 1810 and 1870 were written. Therefore, if power 
is interrupted to the system, upon restart the file system 
1830 comes up in a consistent state. Because 8-32 MB 
of disk space are used in typical prior art "clone" of a 1 
GB file system, clones are not conducive to consistency 
points or snapshots as is the present invention. 
[0103] Referring to Figure 22, two previous snap- 
shots 2110A and 21108 exist on disk. At the instant 
when a third snapshot is created, the root inode pointing 
to the active file system is copied into the inode entry 
21 IOC for the third snapshot in the inode file 2110. At 
the same time in the consistency point that goes 
through, a flag indicates that snapshot 3 has been cre- 
ated. The entire file system is processed by checking if 
BITO (or each entry in the bikmap file is set (1) or 
cleared (0). All the BITO values for each bikmap entry 
are copied into the plane for snapshot three. When 
completed, every active block 2110-2116 and 1207 in 
the file system is in the snapshot at the instant it is 
taken. 



[0104] Blocks that have existed on disk continu- 
ously for a given length of time are also present in cor- 
responding snapshots 2110A-2110B preceding the 
third snapshot 21 IOC. If a block has been in the file sys- 
5 lem for a long enough period of time, it is present in all 
the snapshots. Block 1207 is such a block. As shovm in 
Rgure 22. tslock 1207 is referenced by inode 2210G of 
the active inode file, and indirectly by snapshots 1. 2 
and 3. 

10 [0105] The sequential order of snapshots does not 
necessarily represent a chronological sequence of file 
system copies. Each individual snapshot in a file system 
can be deleted at any given time, thereby making an 
entry available for sut>sequent use. When BITO of a t)lk- 

15 map entry that references the active file system is 
cleared (indicating the block has been deleted from the 
active file system), the block cannot be reused if any of 
the snapshot reference bits are set. This is because the 
block is part of a snapshot that is still in use. A block can 

20 only be reused when all the bits in the t)lkmap entry are 
set to zero. 

Algorithm for Generating a Snapshot 

25 [0106] Creating a snapshot is almost exactly like 
aeating a regular consistency point as shown in Figure 
5. In step 510, all dirty inodes are marked as being in 
the consistency point. In step 520, all regular files are 
flushed to disk. In step 530, special files (i.e., the inode 

30 file and the t>lkmap file) are flushed to disk. In step 540. 
the fsinfo tjlocks are flushed to disk. In step 550, all 
inodes that were not in the consistency point are proc- 
essed. Figure 5 is desaibed above in detail. In fact, cre- 
ating a snapshot is done as part of creating a 

35 consistency point. The primary difference between cre- 
ating a snapshot and a consistency point is that all 
entries of the bikmap file have the active FS-b*t copied 
into the snapshot bit. The snapshot bit represents the 
corresponding snapshot in order to protect the blocks in 

40 the snapshot from being overwritten. The creation arvj 
deletion of snapshot is performed in step 530 because 
that is the only point where the file system is completely 
self-consistent ard about to go to disk. 
[0107] Different steps are performed in step 530 

45 then illustrated in Rgure 6 for a consistency point when 
a new snapshot is created. The steps are very similar to 
those for a regular consistency point. Rgure 7 is a flow 
diagram illustrating the steps that step 530 comprises 
for creating a snapshot. As described above, step 530 

50 allocates disk space for the kdkmap file and the inode file 
and copies the active FS-bit into the snapshot bit that 
represents the corresponding snapshot in order to pro- 
tect the t)locks in the snapshot from being overwritten. 
[0108] Instep710.theinodesof thetrfknrapffleand 

55 the snapshot being created are pre-flushed to disk. In 
addition to flushing the mode of the bikmap file to a 
bkxk of the irxxJe file (as in step 610 of Rgure 6 for a 
consistency point), the inode of the snapshot being cre- 
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ated is also flushed to a block of the inode file. This 
ensures that the block of the inode file containing the 
inode of the snapshot is dirty. 

[0109] In step 720, every block in the bikmap file is 
dirtied. In step 760 (described below), all entries in the 5 
bikmap file are updated instead of just the entries in 
dirty blocks. Thus, all tilocks of the bikmap file must be 
marked dirty here to ensure that step 730 write-allo- 
cates disk space for them. 

[0110] In step 730. disk space is allocated for alt w 
dirty blocks in the inode and bikmap files. The dirty 
blocks include the block in the inode file containing the 
inode of the bikmap file, which Is dirty, and the block 
containing the inode for the new snapshot. 
[01 1 1 ] In step 740. the contents of the root inode for is 
the file system are copied into the inode of the snapshot 
in the inode file. At this time, every block that is part of 
the new consistency point and that will be written to disk 
has disk space allocated for it. Thus, duplicating the root 
inode in the snapshot irxxle effectively copies the entire 20 
active file system. The actual blocks that will be in the 
snapshot are the same Ijlocks of the active file system. 
[01 1 2] In step 750. the inodes of the bikmap file and 
the snapshot are copied to into the inode file. 
[0113] In step 760, entries in the bikmap file are 25 
updated. In addition to copying the active FS-bit to the 
CP-bit for the entries, the active FS-bit is also copied to 
the snapshot bit corresponding to the new snapshot. 
[01 1 4] In step 770. all dirty blocks in the IMkmap and 
inode files are written to disk. 30 
[01 1 5] Finally, at some time, snapshots themselves 
are removed from the file system in step 760. A snap- 
shot is removed from the file system by clearing its 
snapshot inode entry in the inode file of the active file 
system arxJ clearing each bit corresponding to the 35 
snapshot number in every entry in the bikmap file. A 
count is performed also of each bit for the snapshot in 
all the bikmap entries that are cleared from a set value, 
thereby providing a count of the trfocks that are freed 
(corresponding amount of disk space that is freed) by 40 
deleting the snapshot The system decides which snap- 
shot to delete on the basis of the oldest snapshots. 
Users can also choose to delete specified snapshots 
manually. 

[01 1 6] The present invention limits the total number 45 
of snapshots and keeps a bikmap file that has entries 
with multiple bits for trackir^ the snapshots instead of 
using pointers having a CXDW bit as in Episode. An 
unused block has all zeroes for the bits in its bikmap file 
entry Over time, the BITO for the active file system is so 
usually turned on at some instant. Setting BITO identi- 
fies ttie corresponding biock as allocated in the active 
file system. As indicated above, all snapshot bits are ini- 
tially set to zero. If the active file bit is cleared before any 
snapshot bits are set. the block is not present in any ss 
snapshot stored on disk. Therefore, the block is immedi- 
ately available for reallocation and cannot be recovered 
subsequently from a snapshot. 
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Generation of a Snapshot 

[0117] As desaibed previously, a snapshot is very 
similar to a consistency point. Therefore, generation of a 
snapshot is described with reference to the differences 
between it and the generation of a consistency point 
shown in Figures 17A-17L Figures 21A-21F illustrates 
the differences for generating a snapshot. 
[0118] Figures 17A-17D illustrate the state of the 
WAFL file system when a snapshot is begun. All dirty 
modes are marked as being in the consistency point in 
step 510 and regular files are flushed to disk in step 
520. Thus, initial processing of a snapshot is identical to 
that for a consistency point. Processing for a snapshot 
differs in step 530 from tat for a consistency pdnt The 
following describes processing of a snapshot according 
to Figure 7. 

[0119] The following desaiption is for a second 
snapshot of the WAFL file system. A first snapshot is 
recorded in the bikmap entries of Figure 17C. As indi- 
cated in entries 2324A-2324M. blocks 2304-2306. 
2310-2320. and 2324 are contained in the first snap- 
shot. All other snapshot bits (BIT1-BIT20) are assumed 
to have values of 0 indicating that a corresponding 
snapshot does not exist on disk. Figure 21 A illustrates 
the file system after steps 510 and 520 are completed. 
[0120] In step 710. inodes 2308C and 2308D of 
snapshot 2 and bikmap file 2344 are pre-f lushed to disk. 
This ensures that the block of the inode file that is going 
to contain the snapshot 2 inode is dirty. In Figure 2 IB. 
inodes 2308C and 2308D are pre-f lushed for snapshot 
2 and for blknnap file 2344. 

[0121] In step 720. the entire bikmap file 2344 is 
dirtied. This will cause the entire bikmap file 2344 to be 
allocated disk space in step 730. In step 730. disk space 
is allocated for dirty blocks 2308 and 2326 for inode file 
2346 and bikmap file 2344 as shown in Figure 21C. This 
is indicated by a triple asterisk (***) beside blocks 2308 
and 2326. This is different from generating a consist- 
ency point where disk space is allocated only for blocks 
having entries that have changed in the bikmap file 
2344 in step 620 of Figure 6. Bikmap file 2344 of Figure 
21C comprises a single block 2324, However, when bik- 
map file 2344 comprises more than one block, disk 
space is allocated for all the blocks in step 730. 
[0122] In step 740, the root inode for the new file 
system is copied into inode 2308D for snapshot 2. In 
step 750. the inodes 2308C and 2308D of bikmap file 
2344 and snapshot 2 are flushed to disk as illustrated in 
Figure 2 ID. The diagram illustrates that snapshot 2 
inode 2308D references blocks 2304 and 2308 but not 
block 2306. 

[0123] In step 760. entries 2326A-2326L in block 
2326 of the bikmap file 2344 are updated as illustrated 
in Figure 21 E. The diagram illustrates that the snapshot 
2 bit (BIT2) is updated as well as the FS-BIT and CP- 
BIT for each entry 2326A-2326L Thus, blocks 2304, 
2308-2312. 2316-2318, 2322. and 2326 are contained 
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in snapshot 2 whereas blocks 2306, 2314, 2320. and 
2324 are not In step 770. the dirty blocks 2308 and 
2326 are written to disk 

[01 24] Furth©- processing of snapshot 2 is identical 
to that for generation of a consistency point illustrated in 
Figure 5. In step 540. the two fsinfo blocks are flushed 
to disk. Thus. Fgure 21 F represents the WAFL file sys- 
tem in a consistent state after this step. Files 2340. 
2342. 2344. and 2346 of the consistent file system, after 
step 540 is conpleted. are irxlicated within dotted lines 
in Figure 21 F. In step 550, the consistency point is com- 
pleted by processing inodes that were not in the consist- 
ency point 

Access Time Overwrites 

[0125] Unix file systems must maintain an "access 
time" (atime) in each inode. Atime indicates the last time 
that the file was read. It is updated every time the file is 
accessed. Consequently, when a file is read the block 
that contains the inode in the inode file is rewritten to 
update the inode. This could be disadvantageous for 
creating snapshots because, as a consequence, read- 
ing a file could potentially use up disk space. Further, 
reading all the files in the file system could cause the 
entire inode file to be duplicated. The present invention 
solves this problem. 

[0126] Because of atime. a read could potentially 
consume disk space since modifying an inode causes a 
new t>Iock for the inode file to written on disk. Further, a 
read operation couid potentially fail if a file system is full 
which is an abnormal condition for a file system to have 
occur. 

[0127] In general, data on disk is not overwritten in 
the WAFL file system so as to protect data stored on 
disk. The only exception to this rule is atime overwrites 
for an inode as illustrated in Figures 23A-23B. When an 
"atime overwrites" occurs, the only data that is modified 
in a block of the inode file is the atime of one or more of 
the inodes it contains and the trfcick is rewritten in the 
same location. This is the only exception in the WAFL 
system, otherwise new data is always written to new 
disk locations. 

[0128] In Rgure 23A. the atimes 2423 and 2433 of 
an inode 2422 in an okJ WAFL inode file block 2420 and 
the snapshot inode 2432 that references block 2420 are 
illustrated. Inode 2422 of block 2420 references direct 
block 2410- The atime 2423 of inode 2422 is "4/30 9:15 
PM" whereas the atime 2433 of snapshot inode 2432 is 
"5/1 10:00 AM". Figure 23A illustrates the file system 
before direct buffer 2410 is accessed. 
[0129] Fgure 238 illustrates the inode 2422 of 
direct block 2410 after direct block 2410 has been 
accessed. As shewn in the diagram, the access time 
2423 of inode 2422 is overwritten with the access time 
2433 of snapshot 2432 that references it. Thus, the 
access time 2423 of inode 2422 for direct block 2410 is 
"5/1 1153 AM". 



30 

[0130] Allowing inode file blocks to be overwritten 
with new atimes produces a slight inconsistency in the 
snapshot. The atime of a file in a snapshot can actually 
t>e later than the time that the snapstot was created. In 

5 order to prevent users from detecting this inconsistency. 
WAFL adjusts the atime of all files in a snapshot to the 
time when the snapshot was actually aeated instead of 
the time a file was last accessed. This snapshot time is 
stored in the inode that deserves the snapshot as a 

10 wfKrfe. Thus, when accessed via the snapshot, the 
access time 2423 for inode 2422 is always reported as 
"5/1 10:00AM"- This occurs both before the updaXe 
wh^ it may be expected to be "4/30 9: 1 5PM". and after 
the update when it may be expected to be "5/1 

15 1 1:23AM". When accessed through the active file sys- 
tem, the times are reported as "4/30 9:15PM" and "5/1 
1 1:23AM" before and after the update, respectively. 

[0131] In this manner, a method is disclosed for 
maintaining a file system in a consistent state and for 
20 creating read-only copies of the file system- 
Claims 

1 . A method for generating a consistency point com- 
25 prising the steps of: 

marking (510) a plurality of inodes, an inode 

being a file definition structure describing at 

least one file in a file system, pointing to a pi u- 
30 ralrty of nxxJrfied blocks in a file system as 

being in a consistency point; 

flushing (520) regular files and meta-data files 

(530) to storage means; 

flushing (540) at least one block of file system 
35 information to said storage means; and 

requeueing (550) any dirty inodes that were not 

part of said consistency point. 

2. The method of daim 1 wherein said step of flushing 
40 said meta-data files to said storage means further 

comprises the steps of: 

pre-f lushing (610) an inode for blockmap file to 
an inode file; allocating (620) space on said 
45 Storage means for all dirty t>locks in said inode 

and said blockmap files; 
flushing (630) said mode for said blockmap file 
again; 

updating (640) a plurality of entries in said 
so blockmap file wherein each entry of sad plural- 

ity of entries represents a block on said storage 
means; and 

writing (650) all dirty blocks in said t>lockmap 
file arxj said inode file to said storage means. 

55 

Patentanspruche 

1. Vertahren zum Erzeugen eines Konsistenzpunkts, 



EP0 702 815 B1 



16 



31 



EP0 702 815 B1 



32 



umfassend die Schrrtte: 

Markieren (510) einer Mehrzahl von Inoden. 
wobei eine Inode eine Dateidefinrtionsstruktur 
ist. die zumindest eine Datei in einem Dateisy- 5 
stem beschreibt. die auf mehrere modifizierte 
BlOcke in einem Dateisystem verwetst. als in 
einem Konsistenzpunkt befindlich; 

Rdumen (520) reguldrer Dateien sowie Meta- w 
dateien (530) auf eine Speichereinrichtung; 

RSumen (540) mindestens eines Blocks von 
Dateisysteininformation auf die Speicherein- 
richtung; und IS 

erneutes Einstellen (550) jeglicher beruhrter 
Inoden. die nicht Teil des Konsistenzpunkls 
waren. in eine Warteschlange. 

20 

2. Verfahren nach Anspruch 1 . bei dem der Schritt des 
Raumens von Metadateien auf die Speichereinricfi- 
tung weiterhin folg ende Schritte beinhaltet: 

Vonaumen (610) einer Inode aus einer Block- 25 
abbildungsdatei in eine Inodendatei; 

Zuweisen (620) von Platz auf der Speicherein- 
richtung fur samtliche beruhrten BlOcke in der 
Inode und den Blockabbildungsdateien; 3o 

erneutes Raumen (630) der Inode fur die 
Blockabbildungsdatei ; 

Aktualisieren (640) einer Mehrzahl von Eintra- 35 
gen in der Blockabbildungsdatei. wobei jeder 
Eintrag unter den mehreren Eintragen einen 
Block auf der Speichereinrichtung reprasen- 
tiert; und 

40 

Schreiben (650) sSmtlicher beruhrter BlOcke In 
der BlockabfaiWungsdatei und der Inodendatei 
auf die Speichereinrichtung. 

Revendlcations 45 

1 . Proc6d6 pour g6n6rer un point de coherence conv 
portant les Stapes consislant a : 

marquer (510) une plurality dinodes. un inode so 
6tant une structure de definition de fichier 
d§crivant au moins un fichier dans m syst^me 
de fichiers. en p>ointant vers une plurality de 
blocs modifies dans un systdme de fichiers 
comme 6tant un point de coherence, 55 
expulser (520) des fichiers standards et des 
fichiers de m6ta-donn6es (530) dans des 
moyens de m6nx>risation, 



expulser (540) au moins un bloc d*informations 
de syst6me de fichiers dans lesdits moyens de 
memorisation, et 

remettre en file d*attenle (550) tout mode incor- 
rect qui ne faisait pas partie dudit point de 
coherence. 

2. Proc6d6 selon la revendication 1 , dans lequel ladite 
6tape consistant a expulser lesdits fichiers de 
m6ta-donn6es dans lesdits moyens de memorisa- 
tion conporte en outre les 6tapes consistant a : 

pr6-expulser (610) un inode dun fichier de 
representation de blocs dans un fichier dino- 
des. 

allouer (620) un espace dans lesdits moyens 

de mennorisation k tous les blocs incorrects 

situ§s dans lesdits fichiers dinodes et de 

representation de blocs. 

expulser a nouveau (630) ledit inode dudit 

fichier de representation de Woes. 

mettre a jour (640) une plurality d'entr^es dans 

ledit fichier de representation de blocs dans 

lequel chaque entree de ladite pluralite 

d'entrees repr6sente un bloc dans lesdits 

moyens de memorisation, et 

ecrire (650) tous les blocs incorrects dans ledit 

fichier de representation de blocs et ledit f icN'er 

d'inodes desdits moyens de memorisation. 
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